Mentor - Mr. Rohit Raj
Members
For industries around the world, accidents in the work place are of a major concern, since it affects the lives and well being of their employees, contractors and their families and the industry faces loses in terms of hospital charges, litigation fees, reputation and lost employee morale. Based on these facts it is intented to build a chatbot that can highlight the safety risk as per the incident description to the professionals including:
1.Personnel from the safety and complaince team
2.Senior management from the plant
3.Personnel from other plants across the globe
4.Government and industrial safety groups 5.Anyone interested or doing research in industrial safety
6.Emergency health and safety teams
7.Fire safety and industrial hazard teams
8.General management
9.Other personnel requiring safety risk information
so that these professionals can:
Take preventive and proactive measures based on past history React faster to employee satisfaction realated to safety Help postion the equipment and machinery in a safe place where risk of potential acceidents can be minimised Gain insights about safety in industries safety is paramound Reduce insurance costs by better handling of personnel, equipment and other resources Take other safety related decisions and actions
The user should be able to input an incident description and the chatbot should be able to predict the potential accident or vulnerability levels which can be extended or configured to different scenarios
The dataset basically describes the accident incidents from twelve different plants across three different countries and consists of four hundred and twenty five records It has the following columns:
Date: timestamp or time/date information
Countries: Which country the accident occurred (anonymised)
Local: The city where the manufacturing plant is located (anonymised)
Industry sector: Which sector the plant belongs to
Accident level: From I to VI, it registers how severe was the accident (I means not severe but VI means very severe)
Potential Accident Level: From I to VI, depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)
Gender: If the person involved is male of female
Employee or Third Party: If the injured person is an employee or a third party / contractor
Critical Risk: Description of the risk involved in the accident
Description: Detailed description of how the accident happened
On inspection of the dataset it appears that:
1.The dataset is limited and consists of four hundred and twenty five records only so training the models with high accuracy could be a challange
2.The dataset is imbalanced on certain variables like potential accident level and accident level, this means that we may not get consistent results unless the dataset is treated to reduce imbalance
3.Minor accidents are more common than major accidents, this looks similar to real world situations
4.There is data from three countries
5.There are twelve locals or cities from which the data is taken
6.There are two industry sectors - mining, metals and third all others grouped together as others
7.There are five accident levels
8.There are six potential accident levels
9.There are employees, third parties and remote third parties involved in the accidents
10.There are thirty three diffrent types of critical risk one of which has been assigned to a accident incident
11.The accident description is highly unclean and so it will require a considerable amount of effort to clean it to produce results
12.The dataset consists of data from January 2016 to July 2017
13.Males are more involved than females in accidents, this too looks similar to real world situations as there are considerably lower number of females working in industrial environments
Approach - We have agreed on designing a Chatbot capability using slack as an UI interface integrating with RASA and API that triggers the underlying NLP Model that gets build
We have established agreed intermediate goals and progressed on the below process steps
As part of the NLP Model building we have adoptped the below process steps
Data processing techniques Data cleansing Features engineering Lematizing,stemming Removing stop words and Glove embedding Data visualization with charts to be able to see clearly how the data is spread across different dimentions with univariate, bi and multi variate analysis Model designing - As part of model designing we have designed and trained the below models
Random Forest Gradient Boosting Lgistic regerssion SVM and Neural Network classifiers such as
RNN LSTM and Bi-directional LSTM FastText and we are fine tuning and evaluating the best performing model to be shipped for the API that gets triggered from Slack user interface Findings From the data analysis we could infer that
Many Body related actions and accidenrs have been found A lot of equipment related accidents cited in the dataset Poor features with lack of quality or inadequate data resulting in class imbalance
Since the data shows that the Accident severty is Low for Critical we will have to consider both Accident level as well as Potential accident level for the Model prediction
is_on_colab = False # set this variable based on current environment
# Basic packages
import pandas as pd, numpy as np, seaborn as sns, gc
from scipy import stats; from scipy.stats import zscore, norm, randint
import matplotlib.style as style; style.use('fivethirtyeight')
import plotly
print('Plotly Version: ' + str(plotly.__version__))
if is_on_colab == False:
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default='notebook'
else:
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default='colab'
# Importing the required libraries
from sklearn.impute import SimpleImputer
import spacy
# Models
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score, learning_curve
# Display settings
pd.options.display.max_rows = 400
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
random_state = 42
np.random.seed(random_state)
# importing os for setting path
import os
# set this as your working directory
working_dir = 'E:\\Great Learning\\DL\\Capstone\Data\\'
# Suppress warnings
import warnings; warnings.filterwarnings('ignore')
Plotly Version: 5.3.1
if is_on_colab == True:
from google.colab import drive
drive.mount('/content/drive')
# Project path
os.chdir(working_dir)
# Loading the Data from drive
df = pd.read_csv('Data Set - industrial_safety_and_health_database_with_accidents_description.csv')
print(df.shape)
df.head()
(425, 11)
| Unnamed: 0 | Data | Countries | Local | Industry Sector | Accident Level | Potential Accident Level | Genre | Employee or Third Party | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2016-01-01 00:00:00 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... |
| 1 | 1 | 2016-01-02 00:00:00 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... |
| 2 | 2 | 2016-01-06 00:00:00 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... |
| 3 | 3 | 2016-01-08 00:00:00 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 C... |
| 4 | 4 | 2016-01-10 00:00:00 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances t... |
# Dropping the unwanted column
ds = df.copy()
ds.drop('Unnamed: 0',1, inplace = True)
# Checking the information of the Dataset
ds.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 425 entries, 0 to 424 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Data 425 non-null object 1 Countries 425 non-null object 2 Local 425 non-null object 3 Industry Sector 425 non-null object 4 Accident Level 425 non-null object 5 Potential Accident Level 425 non-null object 6 Genre 425 non-null object 7 Employee or Third Party 425 non-null object 8 Critical Risk 425 non-null object 9 Description 425 non-null object dtypes: object(10) memory usage: 33.3+ KB
# Displaying the columns
ds.columns
Index(['Data', 'Countries', 'Local', 'Industry Sector', 'Accident Level',
'Potential Accident Level', 'Genre', 'Employee or Third Party',
'Critical Risk', 'Description'],
dtype='object')
# Renaming the Features of the Dataset
ds.rename(columns= {'Data':'Date', 'Countries':'Country', 'Genre':'Gender',
'Employee or Third Party':'Employee type'}, inplace =True)
ds.head()
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 00:00:00 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... |
| 1 | 2016-01-02 00:00:00 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... |
| 2 | 2016-01-06 00:00:00 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... |
| 3 | 2016-01-08 00:00:00 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 C... |
| 4 | 2016-01-10 00:00:00 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances t... |
# Null value check
pd.DataFrame(ds.isnull().sum(), columns=['Missing value'])
| Missing value | |
|---|---|
| Date | 0 |
| Country | 0 |
| Local | 0 |
| Industry Sector | 0 |
| Accident Level | 0 |
| Potential Accident Level | 0 |
| Gender | 0 |
| Employee type | 0 |
| Critical Risk | 0 |
| Description | 0 |
ds.duplicated().sum()
7
ds.drop_duplicates(subset='Description', inplace=True, keep=False)
ds.shape
(399, 10)
ds['Date'] = pd.to_datetime(ds['Date']) # Creating the new feature Date to analyse the Accident
ds['Year'] = ds['Date'].apply(lambda x: x.year) # Creating the new feature Year to analyse the Accident
ds['Month'] = ds['Date'].apply(lambda x: x.month) # Creating the new feature Month to analyse the Accident
ds['Day'] = ds['Date'].apply(lambda x: x.day) # Creating the new feature Day to analyse the Accident
ds['Weekday'] = ds['Date'].apply(lambda x: x.day_name()) # Creating the new feature Weekday to analyse the Accident
ds.head()
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | Year | Month | Day | Weekday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... | 2016 | 1 | 1 | Friday |
| 1 | 2016-01-02 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... | 2016 | 1 | 2 | Saturday |
| 2 | 2016-01-06 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... | 2016 | 1 | 6 | Wednesday |
| 3 | 2016-01-08 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 C... | 2016 | 1 | 8 | Friday |
| 4 | 2016-01-10 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances t... | 2016 | 1 | 10 | Sunday |
# Defining a function to creat new feature called Quater
def month_quater_conversion(x):
if x in [1, 2, 3]:
season = 'First'
elif x in [4, 5, 6]:
season = 'Second'
elif x in [7, 8, 9]:
season = 'Third'
elif x in [10, 11, 12]:
season = 'Fourth'
return season
ds['Quater'] = ds['Month'].apply(month_quater_conversion)
ds.head()
| Date | Country | Local | Industry Sector | Accident Level | Potential Accident Level | Gender | Employee type | Critical Risk | Description | Year | Month | Day | Weekday | Quater | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | Country_01 | Local_01 | Mining | I | IV | Male | Third Party | Pressed | While removing the drill rod of the Jumbo 08 f... | 2016 | 1 | 1 | Friday | First |
| 1 | 2016-01-02 | Country_02 | Local_02 | Mining | I | IV | Male | Employee | Pressurized Systems | During the activation of a sodium sulphide pum... | 2016 | 1 | 2 | Saturday | First |
| 2 | 2016-01-06 | Country_01 | Local_03 | Mining | I | III | Male | Third Party (Remote) | Manual Tools | In the sub-station MILPO located at level +170... | 2016 | 1 | 6 | Wednesday | First |
| 3 | 2016-01-08 | Country_01 | Local_04 | Mining | I | I | Male | Third Party | Others | Being 9:45 am. approximately in the Nv. 1880 C... | 2016 | 1 | 8 | Friday | First |
| 4 | 2016-01-10 | Country_01 | Local_04 | Mining | IV | IV | Male | Third Party | Others | Approximately at 11:45 a.m. in circumstances t... | 2016 | 1 | 10 | Sunday | First |
# Converting the class of target to numeric
replace_value = {'I':1, 'IV':4, 'III':3, 'II':2, 'V':5}
ds['Accident Level'] = ds['Accident Level'].map(replace_value)
replace_value = {'IV':4, 'III':3, 'I':1, 'II':2, 'V':5, 'VI':5}
ds['Potential Accident Level'] = ds['Potential Accident Level'].map(replace_value)
del replace_value
# Analysing the categorical features
cats = ['Country', 'Local', 'Industry Sector', 'Accident Level',
'Potential Accident Level', 'Gender', 'Employee type', 'Critical Risk',
'Year', 'Month', 'Day', 'Weekday', 'Quater']
# Histogram of Country
fig = px.histogram(ds, x = 'Country', width=800, height=500, category_orders=dict(Country = ['Country_01','Country_02', 'Country_03']))
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Local with country
fig = px.histogram(ds, x = 'Local',color = 'Country', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Industry Sector
fig = px.histogram(ds, x = 'Industry Sector', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Accident Level
fig = px.histogram(ds, x = 'Accident Level', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram Target Variable Potential Accident Level
fig = px.histogram(ds, width=800, height=500, x ='Potential Accident Level')
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Gender
fig = px.histogram(ds, x = 'Gender', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Employee type
fig = px.histogram(ds, x = 'Employee type',width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Histogram of Critical Risk
fig = go.Figure(data = go.Histogram(y = ds['Critical Risk'].values))
fig.update_layout(bargap = .4)
fig.show()
# Histogram of Quater
fig = px.histogram(ds, x = 'Quater', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
# Sector which is most effected
fig = px.pie(ds, names='Industry Sector', template='seaborn')
fig.update_traces(rotation=90, pull=[0.2,0.03,0.1,0.03,0.1], textinfo="percent+label", showlegend=False)
fig.show()
# Potential Accident Level per country
fig = px.histogram(ds, x ='Country', color='Potential Accident Level', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Country_01 is the most effected country and most of the classes of Potential Accident Level belongs to country_01
# Accident Level per country
fig = px.histogram(ds, x ='Country', color='Accident Level', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Country_01 is the most effected country and most of the classes of Accident Level belongs to country_01
# Industry sector most effected by Potential Accident Level
fig = px.histogram(ds, x ='Industry Sector', color='Potential Accident Level', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Mining sector is the most effected and severity level of Accidents also belongs to the same sector
# Potential Accident Level in in each Quater
fig = px.histogram(ds, color= 'Potential Accident Level', x='Quater', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
The First and second quater accounts for higher level of Accident which is level 4 and 5.
# Critical Risk vs Potential Accident Level
fig = px.histogram(ds, x ='Critical Risk', color='Potential Accident Level')
fig.update_layout(bargap = 0.2)
fig.show()
Most of the classes of Potential Accident Level are from other class of Critical Risk which is 232 in No.
The severity of the Potential Accident Level are from the class Fall, Electrical installation, Vehicles, Projection, Pressed and Mobile equipment.
# Critical Risk vs Industry Sector
fig = px.histogram(ds, x ='Critical Risk', color='Industry Sector')
fig.update_layout(bargap = 0.2)
fig.show()
Mining sector is the most effected sector and most of the classes of Critical Risk comes from this sector.
fig = px.histogram(ds, color ='Potential Accident Level', x='Accident Level', width=800, height=500) fig.update_layout(bargap = 0.2) fig.show()
Class 1 of the Accident Level accounts for most of the accident and reaches to all the classes of Potential Accident Level which is 1,2,3,4,5
# Employee type vs Potential Accident Level
fig = px.histogram(ds, color ='Potential Accident Level', x='Employee type', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Third Party and Employee are the most effected Employee type
# Industry sector vs Potential Accident Level, Gender and Accident Level
fig = px.bar(ds, x="Industry Sector", y="Accident Level", color="Gender", barmode="group", facet_col="Potential Accident Level")
fig.show()
Males are the most effected gender with Potential Accident Level 4 and 5 which is from Mining sector.
# Local with Employee type
fig = px.histogram(ds, color = 'Employee type', x ='Local', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Local_3 is the most effected city and most effected class of Employee type are Third Party and Employee.
# Local vs Industry Sector
fig = px.histogram(ds, color = 'Industry Sector', x ='Local', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Local 3 has highest number of Mining industry sector accident.
Local 5 has highest number of Metals industry sector accident.
All the Mining industry sector accidents happend in Local 1,2,3,4,7.
All the Metals industry sector accidents happend in Local 5,6,8,9 .
All the Others industry sector accidents happend in Local 10,11,12.
# Year vs Potential Accident Level
fig = px.histogram(ds, color = 'Year', x ='Potential Accident Level', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Most of the Accidents happend in the year 2016 and lower in 2017.
# Year vs Industry Sector
fig = px.histogram(ds, color = 'Year', x ='Industry Sector', width=800, height=500)
fig.update_layout(bargap = 0.2)
fig.show()
Most of the Mining Accidents happend in the year 2016 and lower in 2017.
#escription variable with some levels
num = np.random.randint(0, ds.shape[0])
discription = ds.loc[num, 'Description']
industry = ds.loc[num, 'Industry Sector']
accident_severity = ds.loc[num, 'Accident Level']
potential_severity = ds.loc[num, 'Potential Accident Level']
employee_type = ds.loc[num, 'Employee type']
critical_risk = ds.loc[num, 'Critical Risk']
# Checking the max Description lenght before the cleaning
max_description_len = max([len(i.split()) for i in ds['Description']])
print('Max description length:', max_description_len)
Max description length: 183
# Checking the min Description lenght before the cleaning
max_description_len = min([len(i.split()) for i in ds['Description']])
print('Min description length:', max_description_len)
Min description length: 16
# # Loading the Data from drive
df = pd.read_csv('Data Set - industrial_safety_and_health_database_with_accidents_description.csv')
# Dropping the unwanted column
ds = df.copy()
ds.drop('Unnamed: 0',1, inplace = True)
# Renaming the Features of the Dataset
ds.rename(columns= {'Data':'Date', 'Countries':'Country', 'Genre':'Gender',
'Employee or Third Party':'Employee type'}, inplace =True)
# Converting the class of target to numeric
replace_value = {'I':1, 'IV':4, 'III':3, 'II':2, 'V':5}
ds['Accident Level'] = ds['Accident Level'].map(replace_value)
replace_value = {'IV':4, 'III':3, 'I':1, 'II':2, 'V':5, 'VI':5}
ds['Potential Accident Level'] = ds['Potential Accident Level'].map(replace_value)
del replace_value
# Removing HTML tags
from bs4 import BeautifulSoup
def strip_html_tags(text):
soup = BeautifulSoup(text, "html.parser")
stripped_text = soup.get_text()
return stripped_text
# Removing Accented characters
import unicodedata
def remove_accented_chars(text):
text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
return text
# Remove special characters
import re
def remove_special_characters(text, remove_digits=False):
#Using regex
pattern = r'[^a-zA- z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
text = re.sub(pattern, '', text)
return text
# This function removes punctuations
import string
def remove_punctuations(text):
text_wo_punctuations = re.sub('[%s]' % re.escape(string.punctuation), '' , text)
twisted_appostrope = '’'
text_wo_punctuations_twisted_appostrope = re.sub('[%s]' % re.escape('’'), '' , text_wo_punctuations)
return text_wo_punctuations_twisted_appostrope
# This function removes non-english words
def english_text(text):
words = set(nltk.corpus.words.words())
english = " ".join(word for word in nltk.wordpunct_tokenize(text) \
if word.lower() in words or not word.isalpha())
return english
# This function removes numbers
def take_out_numbers(text):
text_wo_numbers = re.sub("\d+", ' ', text)
return text_wo_numbers
# Lemmatization
import nltk
nltk.download('words')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer, PorterStemmer
[nltk_data] Downloading package words to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package words is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
# Function to clean the text
def normalize_corpus(doc, html_stripping=True, accented_char_removal=True, text_lower_case=True,
special_char_removal=True, stopword_removal=True, remove_digits=True):
# strip HTML
if html_stripping:
doc = strip_html_tags(doc)
if accented_char_removal:
doc = remove_accented_chars(doc)
# lowercase the text
if text_lower_case:
doc = doc.lower()
# remove extra newlines
doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
# special_char_removal:
if special_char_removal:
doc = remove_punctuations(doc)
doc = take_out_numbers(doc)
doc = english_text(doc)
# remove extra whitespace
doc = re.sub(' +', ' ', doc)
return doc
# Applying the function to feature Description
ds['clean_Description'] = ds['Description'].apply(lambda x: normalize_corpus(x))
print(ds['clean_Description'][0])
while removing the drill rod of the jumbo for maintenance the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal seeing this the mechanic one end on the drill of the equipment to pull with both the bar and accelerate the removal from this at this moment the bar from its point of support and the of the mechanic between the drilling bar and the beam of the jumbo
# This function lemmatizes the text
for dependency in ("brown", "names", "wordnet", "averaged_perceptron_tagger", "universal_tagset",'stopwords','punkt','words'):
nltk.download(dependency)
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
#from nltk.stem.snowball import SnowballStemmer
from nltk import pos_tag, word_tokenize
#stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
lemmatized_text = ''
for word, tag in pos_tag(word_tokenize(text)):
#print(tag)
wnltag = tag[0].lower()
wnltag = wnltag if wnltag in ['a', 'r', 'n', 'v'] else None
if not wnltag:
lemma = word
else:
lemma = lemmatizer.lemmatize(word, wnltag)
lemmatized_text = lemmatized_text + ' ' + lemma
return lemmatized_text.lstrip()
[nltk_data] Downloading package brown to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package brown is already up-to-date! [nltk_data] Downloading package names to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package names is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date! [nltk_data] Downloading package universal_tagset to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package universal_tagset is already up-to-date! [nltk_data] Downloading package stopwords to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package words to [nltk_data] C:\Users\dsJohn\AppData\Roaming\nltk_data... [nltk_data] Package words is already up-to-date!
# Removing single character
def remove_single_char(text):
pattern = r'\s+[a-zA-Z]\s+'
text = re.sub(pattern, ' ', text)
return text
ds['clean_Description'] = ds['clean_Description'].apply(lambda x: remove_single_char(x))
# Remove two short character
def two_character(text):
pattern = r'\W*\b\w{1,2}\b'
text = re.sub(pattern, '', text)
return text
# Applying the function to remove the word with two characters
ds['clean_Description'] = ds['clean_Description'].apply(lambda x: two_character(x))
# Applying the function to lemmatize the word
ds['clean_Description'] = ds['clean_Description'].apply(lambda x: lemmatize_words(x))
# Randomly vizualizing the clean corpus
num = np.random.randint(0, ds.shape[0])
clean_desc = ds.loc[num, 'clean_Description']
clas = ds.loc[num, 'Potential Accident Level']
print(clas)
print(' ')
print(clean_desc)
3 be when the collaborator sampler be change and remove the from the pulp the plant courier slip and fell the ground support himself with the right hand generate the lesion
# Analysing the N-Grams
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
def count_pos(text):
doc = nlp(str(text))
counts_dict = doc.count_by(spacy.attrs.IDS['POS'])
for pos, count in counts_dict.items():
human_readable_tag = doc.vocab[pos].text
print(human_readable_tag, count)
ds['Description_length'] = [len(i.split()) for i in ds['clean_Description']]
#ds['Description_length'] = [len(i.split()) for i in ds['clean_Description']]
max_description_len = max([len(i.split()) for i in ds['clean_Description']])
print('Max description length:', max_description_len)
Max description length: 133
min_description_len = min([len(i.split()) for i in ds['clean_Description']])
print('Min description length:', min_description_len)
Min description length: 10
# Checking the Mean length of the claen_Description after cleaning
Mean_description_len = ds['Description_length'].mean()
print('Mean description length:', Mean_description_len)
Mean description length: 44.52705882352941
# Analysing the N-Grams
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
def plot_top_ngrams_barchart(text, n=2):
new= text.str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]
def get_top_ngram(corpus, n=None):
vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:10]
top_n_bigrams= get_top_ngram(text,n)[:10]
x,y=map(list,zip(*top_n_bigrams))
fig = px.bar(x=x,y=y, width=800, height=500)
fig.update_layout(bargap=0.2)
fig.show()
# Ploting the word with Two N-Gram
plot_top_ngrams_barchart(ds['clean_Description'], 2)
# # Ploting the word with Three N-Gram
plot_top_ngrams_barchart(ds['clean_Description'], 3)
# # Ploting the word with Four N-Gram
plot_top_ngrams_barchart(ds['clean_Description'], 4)
# Defining the input and target variable
X = ds['clean_Description']
y = ds['Potential Accident Level']
# Spliting the Data into Train and Test
from sklearn.model_selection import train_test_split
cvt = TfidfVectorizer(ngram_range=(1,2), analyzer='word', min_df=5, sublinear_tf=True)
Xc = cvt.fit_transform(X).toarray()
# Applying SMOTE to over sample the Minority class and then to under sample the Majority class
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from matplotlib import pyplot
from numpy import where
over = SMOTE(sampling_strategy={1: 140,2: 140, 3: 140, 5: 140})
under = RandomUnderSampler(sampling_strategy={4: 135})
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)
# transform the dataset
X, y = pipeline.fit_resample(Xc, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)
Counter({1: 140, 2: 140, 3: 140, 5: 140, 4: 135})
# Train & Test split for ML models of Random Forest,Gradiant Boosting,Logestic Regression and Linear SVM
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =1, shuffle=True)
print(X_train.shape)
print(y_train.shape)
print(' ')
print(X_test.shape)
print(y_test.shape)
(556, 1073) (556,) (139, 1073) (139,)
from collections import Counter
from matplotlib import pyplot
counter = Counter(y)
for k, v in counter.items():
per = v/len(y)*100
print('Class=%d, n=%d (%.3f%%)' % (k, v, per))
# plot the distribution
pyplot.bar(counter.keys(), counter.values())
pyplot.show()
Class=1, n=140 (20.144%) Class=2, n=140 (20.144%) Class=3, n=140 (20.144%) Class=4, n=135 (19.424%) Class=5, n=140 (20.144%)
print(X_train.shape)
print(y_train.shape)
print(' ')
print(X_test.shape)
print(y_test.shape)
(556, 1073) (556,) (139, 1073) (139,)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
rfcl = RandomForestClassifier(random_state=1)
rfcl.fit(X_train, y_train)
RandomForestClassifier(random_state=1)
ytest_pred = rfcl.predict(X_test)
acc_rfc = accuracy_score(y_test, ytest_pred)
acc_rfc_tr = rfcl.score(X_train,y_train)
print("Train Accuracy of the Random Forest model : {:.2f}".format(acc_rfc_tr*100))
print("Test Accuracy of the Random Forest model : {:.2f}".format(acc_rfc*100))
Train Accuracy of the Random Forest model : 100.00 Test Accuracy of the Random Forest model : 69.78
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=100)
gbc.fit(X_train, y_train)
ytest_pred_gbc = gbc.predict(X_test)
acc_gbc = accuracy_score(y_test, ytest_pred_gbc)
acc_gbc_tr = gbc.score(X_train,y_train)
print(" Test accuracy of the Gradient boosting model : {:.2f}".format(acc_gbc*100))
print("Train accuracy of the Gradient boosting model : {:.2f}".format(acc_gbc_tr*100))
Test accuracy of the Gradient boosting model : 64.75 Train accuracy of the Gradient boosting model : 100.00
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(class_weight='balanced', max_iter=1000)
lr.fit(X_train, y_train)
ytest_pred = lr.predict(X_test)
acc_lr = accuracy_score(y_test, ytest_pred)
acc_lr_tr = gbc.score(X_train,y_train)
print(" Test accuracy of the LR model : {:.2f}".format(acc_lr*100))
print("Train accuracy of the LR model : {:.2f}".format(acc_lr_tr*100))
Test accuracy of the LR model : 67.63 Train accuracy of the LR model : 100.00
from sklearn.svm import LinearSVC
svc = LinearSVC( max_iter=5000)
svc.fit(X_train, y_train)
ytest_pred = svc.predict(X_test)
# Evaluation
acc_svc = accuracy_score(y_test, ytest_pred)
acc_svc_tr = svc.score(X_train, y_train)
print("Train accuracy of the SVC model : {:.2f}".format(acc_svc_tr*100))
print("Test accuracy of the SVC model : {:.2f}".format(acc_svc*100))
Train accuracy of the SVC model : 100.00 Test accuracy of the SVC model : 74.10
# Printing the performance matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
def print_confusion_matrix(y_test, ytest_predict):
cm = confusion_matrix(y_test, ytest_predict)
cm = pd.DataFrame(cm)
plt.figure(figsize=(4,4))
sns.set()
sns.heatmap(cm.T, square=True, fmt='', annot=True, cbar=False, xticklabels=['1','2','3','4','5'],
yticklabels = ['1','2','3','4','5']).set_title('Confusion Matrix')
plt.xlabel('True label')
plt.ylabel('Predicted label')
plt.show()
print_confusion_matrix(y_test, ytest_pred_gbc)
print(classification_report(y_test, ytest_pred_gbc))
precision recall f1-score support
1 1.00 0.76 0.87 38
2 0.50 0.71 0.59 21
3 0.53 0.59 0.56 29
4 0.36 0.33 0.35 27
5 0.87 0.83 0.85 24
accuracy 0.65 139
macro avg 0.65 0.65 0.64 139
weighted avg 0.68 0.65 0.66 139
import tensorflow
from tensorflow.keras.layers import Bidirectional, Dense, Embedding, Dropout, Flatten, GlobalAveragePooling1D, BatchNormalization, LSTM, GlobalMaxPooling1D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.backend import clear_session
from tensorflow.keras.initializers import Constant
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import TimeDistributed
X = ds['clean_Description']
# Converting the target to one hot for keras model
y = pd.get_dummies(ds['Potential Accident Level']).values
y[0]
array([0, 0, 0, 1, 0], dtype=uint8)
# Spliting the Data for Neural Networks Model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =1, shuffle=True)
print(X_train.shape)
print(y_train.shape)
print(' ')
print(X_test.shape)
print(y_test.shape)
(340,) (340, 5) (85,) (85, 5)
# Defining the parameters and indices for words
max_features = 10000 # Initiating with 10000 features
tokenizer = Tokenizer(num_words= max_features)
# Fitting the tokenizer on Training input feature
tokenizer.fit_on_texts(X_train.tolist())
print(tokenizer.word_index) # Words with its index
{'the': 1, 'and': 2, 'be': 3, 'that': 4, 'with': 5, 'when': 6, 'his': 7, 'from': 8, 'cause': 9, 'hand': 10, 'right': 11, 'employee': 12, 'left': 13, 'operator': 14, 'for': 15, 'time': 16, 'which': 17, 'injury': 18, 'work': 19, 'activity': 20, 'equipment': 21, 'during': 22, 'one': 23, 'moment': 24, 'area': 25, 'level': 26, 'accident': 27, 'pipe': 28, 'use': 29, 'this': 30, 'make': 31, 'out': 32, 'finger': 33, 'collaborator': 34, 'floor': 35, 'assistant': 36, 'worker': 37, 'rock': 38, 'support': 39, 'remove': 40, 'safety': 41, 'have': 42, 'not': 43, 'part': 44, 'mesh': 45, 'cut': 46, 'leave': 47, 'against': 48, 'there': 49, 'carry': 50, 'approximately': 51, 'side': 52, 'after': 53, 'they': 54, 'truck': 55, 'between': 56, 'face': 57, 'height': 58, 'mechanic': 59, 'medical': 60, 'towards': 61, 'metal': 62, 'fall': 63, 'team': 64, 'two': 65, 'do': 66, 'position': 67, 'pump': 68, 'come': 69, 'platform': 70, 'hit': 71, 'move': 72, 'take': 73, 'who': 74, 'injured': 75, 'place': 76, 'generate': 77, 'where': 78, 'drill': 79, 'while': 80, 'back': 81, 'front': 82, 'end': 83, 'foot': 84, 'vehicle': 85, 'access': 86, 'tube': 87, 'company': 88, 'maintenance': 89, 'try': 90, 'its': 91, 'wear': 92, 'hose': 93, 'point': 94, 'water': 95, 'go': 96, 'arm': 97, 'inside': 98, 'plate': 99, 'immediately': 100, 'clean': 101, 'other': 102, 'structure': 103, 'same': 104, 'edge': 105, 'first': 106, 'turn': 107, 'weight': 108, 'due': 109, 'center': 110, 'over': 111, 'hopper': 112, 'fragment': 113, 'head': 114, 'stop': 115, 'bolt': 116, 'step': 117, 'through': 118, 'off': 119, 'small': 120, 'leg': 121, 'event': 122, 'then': 123, 'return': 124, 'slip': 125, 'line': 126, 'load': 127, 'but': 128, 'contact': 129, 'reach': 130, 'their': 131, 'both': 132, 'key': 133, 'upper': 134, 'block': 135, 'bar': 136, 'neck': 137, 'them': 138, 'loader': 139, 'gable': 140, 'car': 141, 'second': 142, 'ladder': 143, 'reaction': 144, 'movement': 145, 'another': 146, 'material': 147, 'diameter': 148, 'into': 149, 'field': 150, 'allergic': 151, 'press': 152, 'belt': 153, 'helmet': 154, 'drilling': 155, 'piece': 156, 'bee': 157, 'lift': 158, 'strike': 159, 'him': 160, 'glass': 161, 'sting': 162, 'enter': 163, 'air': 164, 'base': 165, 'workshop': 166, 'burn': 167, 'acid': 168, 'tank': 169, 'lower': 170, 'ground': 171, 'region': 172, 'silva': 173, 'verify': 174, 'about': 175, 'away': 176, 'glove': 177, 'gate': 178, 'grate': 179, 'slight': 180, 'pressure': 181, 'hold': 182, 'sheet': 183, 'near': 184, 'down': 185, 'ore': 186, 'under': 187, 'release': 188, 'get': 189, 'machine': 190, 'all': 191, 'boot': 192, 'box': 193, 'ring': 194, 'scissor': 195, 'pain': 196, 'forearm': 197, 'solution': 198, 'technician': 199, 'help': 200, 'blow': 201, 'guard': 202, 'without': 203, 'unit': 204, 'impact': 205, 'suddenly': 206, 'open': 207, 'transfer': 208, 'lesion': 209, 'behind': 210, 'bite': 211, 'balance': 212, 'top': 213, 'blade': 214, 'site': 215, 'master': 216, 'basket': 217, 'eye': 218, 'drainage': 219, 'road': 220, 'close': 221, 'cylinder': 222, 'driver': 223, 'hydraulic': 224, 'steel': 225, 'flange': 226, 'surface': 227, 'sleeve': 228, 'plant': 229, 'fell': 230, 'chain': 231, 'mine': 232, 'valve': 233, 'inspection': 234, 'hospital': 235, 'removal': 236, 'person': 237, 'control': 238, 'thumb': 239, 'zinc': 240, 'direction': 241, 'door': 242, 'soil': 243, 'decide': 244, 'travel': 245, 'only': 246, 'last': 247, 'collection': 248, 'welder': 249, 'hammer': 250, 'wooden': 251, 'wound': 252, 'projection': 253, 'system': 254, 'start': 255, 'test': 256, 'change': 257, 'jumbo': 258, 'bag': 259, 'shoulder': 260, 'lip': 261, 'mining': 262, 'type': 263, 'proceed': 264, 'roof': 265, 'third': 266, 'wash': 267, 'twist': 268, 'knee': 269, 'around': 270, 'room': 271, 'rod': 272, 'easel': 273, 'ramp': 274, 'continue': 275, 'frame': 276, 'loading': 277, 'crown': 278, 'personnel': 279, 'forest': 280, 'pulley': 281, 'cathode': 282, 'beam': 283, 'mechanical': 284, 'board': 285, 'suction': 286, 'handle': 287, 'lever': 288, 'superficial': 289, 'section': 290, 'partner': 291, 'onto': 292, 'scaffold': 293, 'cleaning': 294, 'maneuver': 295, 'hook': 296, 'fill': 297, 'force': 298, 'your': 299, 'set': 300, 'detach': 301, 'check': 302, 'liquid': 303, 'rope': 304, 'service': 305, 'find': 306, 'swell': 307, 'normally': 308, 'tire': 309, 'ear': 310, 'project': 311, 'slide': 312, 'wall': 313, 'himself': 314, 'chimney': 315, 'flow': 316, 'locomotive': 317, 'mud': 318, 'presence': 319, 'evaluate': 320, 'chute': 321, 'geological': 322, 'normal': 323, 'quickly': 324, 'staff': 325, 'pit': 326, 'follow': 327, 'minor': 328, 'sample': 329, 'some': 330, 'hot': 331, 'post': 332, 'clerk': 333, 'rubber': 334, 'index': 335, 'rim': 336, 'length': 337, 'operation': 338, 'body': 339, 'attention': 340, 'roll': 341, 'table': 342, 'avoid': 343, 'cable': 344, 'lock': 345, 'local': 346, 'pulp': 347, 'opening': 348, 'concrete': 349, 'throw': 350, 'cover': 351, 'toward': 352, 'break': 353, 'ventilation': 354, 'chin': 355, 'cabin': 356, 'supervisor': 357, 'proceeds': 358, 'order': 359, 'ankle': 360, 'iron': 361, 'gutter': 362, 'discharge': 363, 'high': 364, 'tower': 365, 'day': 366, 'aid': 367, 'evaluation': 368, 'internal': 369, 'electrician': 370, 'probe': 371, 'care': 372, 'inner': 373, 'transport': 374, 'little': 375, 'split': 376, 'electric': 377, 'victim': 378, 'receive': 379, 'central': 380, 'suffer': 381, 'see': 382, 'would': 383, 'tipper': 384, 'cat': 385, 'raise': 386, 'middle': 387, 'auxiliary': 388, 'loose': 389, 'once': 390, 'injure': 391, 'tie': 392, 'anchor': 393, 'she': 394, 'container': 395, 'tip': 396, 'thermal': 397, 'later': 398, 'manual': 399, 'discomfort': 400, 'untimely': 401, 'city': 402, 'trap': 403, 'lose': 404, 'zone': 405, 'push': 406, 'process': 407, 'target': 408, 'noise': 409, 'manually': 410, 'stick': 411, 'well': 412, 'tool': 413, 'rail': 414, 'south': 415, 'causing': 416, 'degree': 417, 'wrist': 418, 'protection': 419, 'lunch': 420, 'leather': 421, 'palm': 422, 'fracture': 423, 'perform': 424, 'bucket': 425, 'pull': 426, 'driller': 427, 'rotation': 428, 'wire': 429, 'hole': 430, 'bruise': 431, 'lens': 432, 'installation': 433, 'intersection': 434, 'emergency': 435, 'approximate': 436, 'broken': 437, 'result': 438, 'mill': 439, 'preparation': 440, 'product': 441, 'instant': 442, 'together': 443, 'because': 444, 'allergy': 445, 'rear': 446, 'extension': 447, 'contusion': 448, 'oil': 449, 'helper': 450, 'doctor': 451, 'engine': 452, 'deep': 453, 'full': 454, 'cheekbone': 455, 'pin': 456, 'clamp': 457, 'forehead': 458, 'strap': 459, 'chest': 460, 'uniform': 461, 'just': 462, 'main': 463, 'climb': 464, 'task': 465, 'feel': 466, 'house': 467, 'suspend': 468, 'scoop': 469, 'chamber': 470, 'positive': 471, 'before': 472, 'notice': 473, 'people': 474, 'protective': 475, 'upon': 476, 'wind': 477, 'rest': 478, 'winch': 479, 'overflow': 480, 'transmission': 481, 'assembly': 482, 'ingot': 483, 'geologist': 484, 'saw': 485, 'oven': 486, 'next': 487, 'long': 488, 'protector': 489, 'adjustment': 490, 'power': 491, 'connection': 492, 'electrical': 493, 'gun': 494, 'complete': 495, 'pot': 496, 'rupture': 497, 'irritation': 498, 'store': 499, 'fourth': 500, 'nut': 501, 'initial': 502, 'cone': 503, 'stand': 504, 'paint': 505, 'free': 506, 'above': 507, 'distal': 508, 'retire': 509, 'previously': 510, 'crane': 511, 'mineral': 512, 'maid': 513, 'any': 514, 'execution': 515, 'dining': 516, 'drive': 517, 'own': 518, 'routine': 519, 'evacuation': 520, 'waste': 521, 'soon': 522, 'radio': 523, 'felt': 524, 'form': 525, 'realize': 526, 'cloth': 527, 'still': 528, 'fire': 529, 'sound': 530, 'metallic': 531, 'squat': 532, 'walk': 533, 'cap': 534, 'rebound': 535, 'bolter': 536, 'will': 537, 'direct': 538, 'compose': 539, 'divine': 540, 'pass': 541, 'accumulation': 542, 'filter': 543, 'fit': 544, 'ustulation': 545, 'recovery': 546, 'boiler': 547, 'displacement': 548, 'dust': 549, 'holder': 550, 'could': 551, 'affect': 552, 'necessary': 553, 'coil': 554, 'warehouse': 555, 'lane': 556, 'incident': 557, 'space': 558, 'her': 559, 'entrance': 560, 'finish': 561, 'alone': 562, 'bomb': 563, 'leak': 564, 'stage': 565, 'unlock': 566, 'convoy': 567, 'withdrawal': 568, 'drawer': 569, 'welding': 570, 'chisel': 571, 'four': 572, 'tension': 573, 'strip': 574, 'advance': 575, 'van': 576, 'intermediate': 577, 'knife': 578, 'correct': 579, 'arc': 580, 'amount': 581, 'rice': 582, 'backwards': 583, 'operate': 584, 'design': 585, 'general': 586, 'effect': 587, 'leakage': 588, 'abruptly': 589, 'fuel': 590, 'chuck': 591, 'drop': 592, 'lamp': 593, 'track': 594, 'apparently': 595, 'empty': 596, 'tried': 597, 'phase': 598, 'flash': 599, 'clinic': 600, 'cross': 601, 'housing': 602, 'secondary': 603, 'additive': 604, 'inch': 605, 'lid': 606, 'seat': 607, 'prick': 608, 'stung': 609, 'wasp': 610, 'wheel': 611, 'wrench': 612, 'old': 613, 'copilot': 614, 'can': 615, 'hat': 616, 'attack': 617, 'catch': 618, 'already': 619, 'industrial': 620, 'guillotine': 621, 'mount': 622, 'involve': 623, 'rub': 624, 'vertical': 625, 'previous': 626, 'blasting': 627, 'telescopic': 628, 'mark': 629, 'way': 630, 'radius': 631, 'substation': 632, 'three': 633, 'rush': 634, 'reconnaissance': 635, 'more': 636, 'continued': 637, 'say': 638, 'jacket': 639, 'staircase': 640, 'steam': 641, 'sustain': 642, 'splash': 643, 'perforation': 644, 'cement': 645, 'fifth': 646, 'license': 647, 'affected': 648, 'exchange': 649, 'put': 650, 'heavy': 651, 'fact': 652, 'strut': 653, 'respective': 654, 'entry': 655, 'bend': 656, 'treatment': 657, 'false': 658, 'observe': 659, 'across': 660, 'mask': 661, 'screen': 662, 'reduce': 663, 'aluminum': 664, 'sink': 665, 'exit': 666, 'disk': 667, 'autoclave': 668, 'feed': 669, 'circumstance': 670, 'official': 671, 'feeder': 672, 'pound': 673, 'brace': 674, 'excoriation': 675, 'cast': 676, 'engineer': 677, 'blunt': 678, 'fine': 679, 'weigh': 680, 'atlas': 681, 'bottom': 682, 'thickener': 683, 'motor': 684, 'horse': 685, 'cervical': 686, 'pick': 687, 'slope': 688, 'canvas': 689, 'forward': 690, 'distance': 691, 'rung': 692, 'stool': 693, 'albino': 694, 'angle': 695, 'current': 696, 'sediment': 697, 'cutter': 698, 'state': 699, 'highway': 700, 'hurry': 701, 'girdle': 702, 'prospector': 703, 'channel': 704, 'ripper': 705, 'safe': 706, 'metatarsal': 707, 'approach': 708, 'bank': 709, 'tail': 710, 'prevent': 711, 'nose': 712, 'align': 713, 'inertia': 714, 'park': 715, 'descending': 716, 'nail': 717, 'identify': 718, 'puddle': 719, 'sole': 720, 'dumper': 721, 'steer': 722, 'stumble': 723, 'heel': 724, 'joint': 725, 'sodium': 726, 'plug': 727, 'socket': 728, 'sure': 729, 'nitric': 730, 'coupling': 731, 'bap': 732, 'content': 733, 'residual': 734, 'cook': 735, 'pour': 736, 'peristaltic': 737, 'reserve': 738, 'skin': 739, 'slightly': 740, 'cutting': 741, 'bounce': 742, 'seal': 743, 'boom': 744, 'plastic': 745, 'large': 746, 'external': 747, 'burning': 748, 'bring': 749, 'imprisonment': 750, 'increase': 751, 'shear': 752, 'phalanx': 753, 'pad': 754, 'energy': 755, 'action': 756, 'panel': 757, 'shell': 758, 'sanitation': 759, 'underground': 760, 'brigade': 761, 'stretcher': 762, 'intention': 763, 'alpha': 764, 'mix': 765, 'thickness': 766, 'bottle': 767, 'low': 768, 'pedro': 769, 'shock': 770, 'collar': 771, 'responsible': 772, 'need': 773, 'paralyze': 774, 'solid': 775, 'corresponding': 776, 'magnetometric': 777, 'branch': 778, 'administrative': 779, 'bore': 780, 'lean': 781, 'escape': 782, 'partially': 783, 'starter': 784, 'clearing': 785, 'spool': 786, 'involuntarily': 787, 'until': 788, 'finally': 789, 'friction': 790, 'unload': 791, 'each': 792, 'management': 793, 'supervision': 794, 'fan': 795, 'toe': 796, 'trip': 797, 'accommodate': 798, 'unclog': 799, 'chemical': 800, 'caustic': 801, 'soda': 802, 'directly': 803, 'east': 804, 'farm': 805, 'ciliary': 806, 'outcrop': 807, 'machete': 808, 'snake': 809, 'pilot': 810, 'lance': 811, 'passing': 812, 'trauma': 813, 'sao': 814, 'furnace': 815, 'station': 816, 'sulfuric': 817, 'superior': 818, 'scoria': 819, 'big': 820, 'spend': 821, 'nearby': 822, 'eyewash': 823, 'respirator': 824, 'diamond': 825, 'lateral': 826, 'guide': 827, 'loosen': 828, 'prepare': 829, 'food': 830, 'reverse': 831, 'supply': 832, 'beehive': 833, 'excite': 834, 'themselves': 835, 'penultimate': 836, 'bridge': 837, 'consultant': 838, 'civilian': 839, 'sharply': 840, 'melt': 841, 'accord': 842, 'width': 843, 'warn': 844, 'instep': 845, 'building': 846, 'automatic': 847, 'stone': 848, 'sit': 849, 'handrail': 850, 'grind': 851, 'measuring': 852, 'effort': 853, 'pole': 854, 'wide': 855, 'outside': 856, 'ahead': 857, 'stump': 858, 'construction': 859, 'mason': 860, 'sand': 861, 'corner': 862, 'pierce': 863, 'clear': 864, 'descend': 865, 'via': 866, 'chicken': 867, 'strong': 868, 'sludge': 869, 'failure': 870, 'cep': 871, 'along': 872, 'electrolysis': 873, 'calf': 874, 'eyelid': 875, 'analysis': 876, 'curl': 877, 'abrupt': 878, 'locking': 879, 'brake': 880, 'slid': 881, 'downward': 882, 'repair': 883, 'splinter': 884, 'limb': 885, 'sudden': 886, 'thigh': 887, 'filling': 888, 'light': 889, 'sampler': 890, 'depth': 891, 'pen': 892, 'interior': 893, 'mobile': 894, 'radial': 895, 'negative': 896, 'window': 897, 'anode': 898, 'skimmer': 899, 'launch': 900, 'lubricator': 901, 'several': 902, 'reducer': 903, 'propeller': 904, 'horizontally': 905, 'seek': 906, 'drain': 907, 'confined': 908, 'shower': 909, 'collect': 910, 'very': 911, 'pocket': 912, 'borehole': 913, 'cruise': 914, 'bit': 915, 'cab': 916, 'gallery': 917, 'screwdriver': 918, 'weld': 919, 'short': 920, 'reason': 921, 'clothes': 922, 'ask': 923, 'below': 924, 'cabinet': 925, 'bracket': 926, 'shovel': 927, 'deviate': 928, 'vegetation': 929, 'whistle': 930, 'rotate': 931, 'compress': 932, 'tunnel': 933, 'final': 934, 'mild': 935, 'concentrate': 936, 'gear': 937, 'cell': 938, 'putty': 939, 'covered': 940, 'tick': 941, 'hooked': 942, 'drag': 943, 'grating': 944, 'grid': 945, 'diagonal': 946, 'crouch': 947, 'tape': 948, 'lookout': 949, 'five': 950, 'loud': 951, 'correspond': 952, 'excavation': 953, 'worn': 954, 'resident': 955, 'cart': 956, 'dump': 957, 'proceeding': 958, 'shake': 959, 'transversely': 960, 'serra': 961, 'garrote': 962, 'shallow': 963, 'carton': 964, 'possible': 965, 'run': 966, 'look': 967, 'consequently': 968, 'belly': 969, 'scaffolding': 970, 'meter': 971, 'diesel': 972, 'accidentally': 973, 'projecting': 974, 'carbon': 975, 'attach': 976, 'few': 977, 'unbalanced': 978, 'excess': 979, 'distributor': 980, 'camera': 981, 'those': 982, 'unexpectedly': 983, 'give': 984, 'contracture': 985, 'sulphide': 986, 'note': 987, 'magazine': 988, 'bearing': 989, 'storage': 990, 'explode': 991, 'pom': 992, 'torch': 993, 'foam': 994, 'pipette': 995, 'since': 996, 'cluster': 997, 'submerge': 998, 'connector': 999, 'able': 1000, 'scaler': 1001, 'insertion': 1002, 'blind': 1003, 'hydroxide': 1004, 'demineralization': 1005, 'sensor': 1006, 'lemon': 1007, 'voltage': 1008, 'outlet': 1009, 'cord': 1010, 'laboratory': 1011, 'coat': 1012, 'absorb': 1013, 'cooling': 1014, 'evacuate': 1015, 'cold': 1016, 'peel': 1017, 'tear': 1018, 'compartment': 1019, 'classification': 1020, 'litter': 1021, 'disengage': 1022, 'needle': 1023, 'stem': 1024, 'retraction': 1025, 'dioxide': 1026, 'overpressure': 1027, 'impacted': 1028, 'usual': 1029, 'bladder': 1030, 'charge': 1031, 'silo': 1032, 'delivery': 1033, 'mouth': 1034, 'surround': 1035, 'funnel': 1036, 'man': 1037, 'mixture': 1038, 'redness': 1039, 'wilder': 1040, 'introduce': 1041, 'toecap': 1042, 'tenth': 1043, 'zero': 1044, 'favor': 1045, 'energize': 1046, 'thermomagnetic': 1047, 'slaughter': 1048, 'promptly': 1049, 'outpatient': 1050, 'municipal': 1051, 'messrs': 1052, 'shipment': 1053, 'rigger': 1054, 'sling': 1055, 'stake': 1056, 'solubilization': 1057, 'chapel': 1058, 'vial': 1059, 'doser': 1060, 'shoot': 1061, 'passage': 1062, 'list': 1063, 'label': 1064, 'sip': 1065, 'enough': 1066, 'machinery': 1067, 'toxicity': 1068, 'future': 1069, 'portion': 1070, 'beetle': 1071, 'size': 1072, 'manifest': 1073, 'shirt': 1074, 'shield': 1075, 'greater': 1076, 'torque': 1077, 'turntable': 1078, 'longer': 1079, 'intense': 1080, 'lumbar': 1081, 'overexertion': 1082, 'medicine': 1083, 'far': 1084, 'also': 1085, 'good': 1086, 'yolk': 1087, 'polling': 1088, 'lead': 1089, 'suitably': 1090, 'shaft': 1091, 'pickup': 1092, 'soft': 1093, 'trench': 1094, 'discover': 1095, 'dry': 1096, 'period': 1097, 'continuously': 1098, 'remain': 1099, 'obstruction': 1100, 'residue': 1101, 'pedal': 1102, 'blown': 1103, 'shipper': 1104, 'anchorage': 1105, 'bob': 1106, 'simultaneously': 1107, 'lack': 1108, 'fix': 1109, 'luna': 1110, 'cruiser': 1111, 'surveying': 1112, 'fasten': 1113, 'aggregate': 1114, 'cylindrical': 1115, 'tree': 1116, 'underwent': 1117, 'moor': 1118, 'gaze': 1119, 'sustaining': 1120, 'sacrifice': 1121, 'accretion': 1122, 'cyclone': 1123, 'duct': 1124, 'module': 1125, 'camp': 1126, 'laundry': 1127, 'earthenware': 1128, 'tour': 1129, 'command': 1130, 'anterior': 1131, 'present': 1132, 'share': 1133, 'equally': 1134, 'spoiler': 1135, 'kneel': 1136, 'warman': 1137, 'victor': 1138, 'visual': 1139, 'cage': 1140, 'untie': 1141, 'teacher': 1142, 'spark': 1143, 'stope': 1144, 'obstruct': 1145, 'corridor': 1146, 'square': 1147, 'reinstallation': 1148, 'transit': 1149, 'tanker': 1150, 'north': 1151, 'defensive': 1152, 'fulcrum': 1153, 'traumatism': 1154, 'trailer': 1155, 'shutter': 1156, 'testimony': 1157, 'research': 1158, 'visit': 1159, 'dizziness': 1160, 'faintness': 1161, 'concussion': 1162, 'electrometallurgy': 1163, 'code': 1164, 'abb': 1165, 'manipulate': 1166, 'bioxide': 1167, 'fissure': 1168, 'subsequently': 1169, 'soldering': 1170, 'dosage': 1171, 'centralizer': 1172, 'facilitate': 1173, 'accelerate': 1174, 'robot': 1175, 'emptiness': 1176, 'sensation': 1177, 'correctly': 1178, 'mixer': 1179, 'respond': 1180, 'drum': 1181, 'location': 1182, 'technical': 1183, 'know': 1184, 'weed': 1185, 'communication': 1186, 'railway': 1187, 'reposition': 1188, 'stability': 1189, 'frank': 1190, 'overlap': 1191, 'blaster': 1192, 'mat': 1193, 'wet': 1194, 'slippery': 1195, 'diagonally': 1196, 'inward': 1197, 'portable': 1198, 'hang': 1199, 'hinge': 1200, 'triangular': 1201, 'shape': 1202, 'heading': 1203, 'thorax': 1204, 'imbalance': 1205, 'manipulation': 1206, 'powder': 1207, 'excessive': 1208, 'sprain': 1209, 'slop': 1210, 'herself': 1211, 'unbalance': 1212, 'mag': 1213, 'acquisition': 1214, 'gap': 1215, 'traverse': 1216, 'ravine': 1217, 'examination': 1218, 'physician': 1219, 'serious': 1220, 'transverse': 1221, 'reduction': 1222, 'progress': 1223, 'swarm': 1224, 'play': 1225, 'visibility': 1226, 'hiss': 1227, 'stopper': 1228, 'mortar': 1229, 'improve': 1230, 'bricklayer': 1231, 'per': 1232, 'personal': 1233, 'pink': 1234, 'woman': 1235, 'cinnamon': 1236, 'polyethylene': 1237, 'thrown': 1238, 'eyebrow': 1239, 'warrin': 1240, 'inspect': 1241, 'crack': 1242, 'inlet': 1243, 'stabilizer': 1244, 'tread': 1245, 'winery': 1246, 'grinder': 1247, 'crosscutter': 1248, 'traumatic': 1249, 'amputation': 1250, 'killer': 1251, 'inform': 1252, 'disconnection': 1253, 'street': 1254, 'warning': 1255, 'pendulum': 1256, 'manipulator': 1257, 'arrange': 1258, 'neutral': 1259, 'airlift': 1260, 'opposite': 1261, 'spin': 1262, 'containment': 1263, 'basin': 1264, 'formation': 1265, 'fish': 1266, 'suture': 1267, 'jack': 1268, 'entire': 1269, 'procedure': 1270, 'effective': 1271, 'inefficacy': 1272, 'chestnut': 1273, 'monkey': 1274, 'faucet': 1275, 'firmly': 1276, 'composition': 1277, 'cathodic': 1278, 'digger': 1279, 'suspender': 1280, 'illness': 1281, 'headlight': 1282, 'defective': 1283, 'formerly': 1284, 'yard': 1285, 'courier': 1286, 'pique': 1287, 'half': 1288, 'trainee': 1289, 'notebook': 1290, 'difficult': 1291, 'register': 1292, 'couple': 1293, 'withdraw': 1294, 'locate': 1295, 'answer': 1296, 'call': 1297, 'distract': 1298, 'even': 1299, 'swing': 1300, 'kiln': 1301, 'battery': 1302, 'crucible': 1303, 'cockpit': 1304, 'launcher': 1305, 'labor': 1306, 'utensil': 1307, 'stir': 1308, 'cooker': 1309, 'hidalgo': 1310, 'want': 1311, 'unstable': 1312, 'reel': 1313, 'frontally': 1314, 'expansion': 1315, 'storm': 1316, 'conclusion': 1317, 'stretch': 1318, 'hatch': 1319, 'shipping': 1320, 'reinforce': 1321, 'deepening': 1322, 'distant': 1323, 'excavator': 1324, 'bump': 1325, 'lubricant': 1326, 'baton': 1327, 'hour': 1328, 'tearing': 1329, 'spillway': 1330, 'absorbent': 1331, 'compressor': 1332, 'bonnet': 1333, 'rag': 1334, 'attempt': 1335, 'stroke': 1336, 'jib': 1337, 'enforce': 1338, 'piston': 1339, 'reception': 1340, 'making': 1341, 'expedition': 1342, 'overall': 1343, 'motorist': 1344, 'rid': 1345, 'saddle': 1346, 'struck': 1347, 'weakly': 1348, 'pipeline': 1349, 'path': 1350, 'unlocking': 1351, 'brush': 1352, 'mincing': 1353, 'blackjack': 1354, 'manifestation': 1355, 'afternoon': 1356, 'habilitation': 1357, 'kitchen': 1358, 'specific': 1359, 'ditch': 1360, 'dune': 1361, 'ago': 1362, 'heat': 1363, 'define': 1364, 'risk': 1365, 'patronal': 1366, 'feast': 1367, 'ceremony': 1368, 'throwing': 1369, 'public': 1370, 'pyrotechnic': 1371, 'frighten': 1372, 'mina': 1373, 'compressed': 1374, 'nozzle': 1375, 'lung': 1376, 'violent': 1377, 'stun': 1378, 'produce': 1379, 'ball': 1380, 'hump': 1381, 'hill': 1382, 'leaf': 1383, 'walter': 1384, 'request': 1385, 'data': 1386, 'congestion': 1387, 'why': 1388, 'fender': 1389, 'elevation': 1390, 'aerial': 1391, 'thinner': 1392, 'flammable': 1393, 'die': 1394, 'blanket': 1395, 'seam': 1396, 'extruder': 1397, 'stylet': 1398, 'solder': 1399, 'insulation': 1400, 'moon': 1401, 'facial': 1402, 'bench': 1403, 'install': 1404, 'skip': 1405, 'everything': 1406, 'restart': 1407, 'apparent': 1408, 'paste': 1409, 'vacuum': 1410, 'keep': 1411, 'void': 1412, 'workplace': 1413, 'disposal': 1414, 'ammonia': 1415, 'refrigerant': 1416, 'topographic': 1417, 'survey': 1418, 'west': 1419, 'pas': 1420, 'suffering': 1421, 'adapter': 1422, 'production': 1423, 'slab': 1424, 'lodge': 1425, 'zinco': 1426, 'rotary': 1427, 'laceration': 1428, 'seven': 1429, 'copper': 1430, 'vision': 1431, 'segment': 1432, 'hip': 1433, 'revegetation': 1434, 'lifeline': 1435, 'rotor': 1436, 'mallet': 1437, 'fisherman': 1438, 'whiplash': 1439, 'tractor': 1440, 'radiator': 1441, 'carpentry': 1442, 'diagnosis': 1443, 'kelly': 1444, 'conductive': 1445, 'rig': 1446, 'pablo': 1447, 'elbow': 1448, 'subsequent': 1449, 'silver': 1450, 'afterwards': 1451, 'latter': 1452, 'inferior': 1453, 'hematoma': 1454, 'fully': 1455, 'washing': 1456, 'row': 1457, 'electrolyte': 1458, 'possibly': 1459, 'properly': 1460, 'disabled': 1461, 'foreman': 1462, 'indicate': 1463, 'paralysis': 1464, 'refuge': 1465, 'orange': 1466, 'alert': 1467, 'detector': 1468, 'provoke': 1469, 'scare': 1470, 'within': 1471, 'flat': 1472, 'beak': 1473, 'shoe': 1474, 'immediate': 1475, 'pedestal': 1476, 'stepladder': 1477, 'strength': 1478, 'eyelash': 1479, 'watermelon': 1480, 'foundry': 1481, 'footwear': 1482, 'engage': 1483, 'assemble': 1484, 'catheter': 1485, 'inclination': 1486, 'lit': 1487, 'almost': 1488, 'supervise': 1489, 'harness': 1490, 'accessory': 1491, 'tailing': 1492, 'colleague': 1493, 'spare': 1494, 'ton': 1495, 'parking': 1496, 'violently': 1497, 'communicate': 1498, 'accidently': 1499, 'maximum': 1500, 'lay': 1501, 'pig': 1502, 'flexible': 1503, 'nylon': 1504, 'watered': 1505, 'shot': 1506, 'itself': 1507, 'comfort': 1508, 'screw': 1509, 'combination': 1510, 'junior': 1511, 'costa': 1512, 'disassembly': 1513, 'sump': 1514, 'verification': 1515, 'confirm': 1516, 'problem': 1517, 'pivot': 1518, 'gauge': 1519, 'saturate': 1520, 'talus': 1521, 'crest': 1522, 'rugged': 1523, 'taut': 1524, 'cheek': 1525, 'isolate': 1526, 'link': 1527, 'shorten': 1528, 'injection': 1529, 'resin': 1530, 'band': 1531, 'uneven': 1532, 'distribution': 1533, 'upwards': 1534, 'timely': 1535, 'earth': 1536, 'cubic': 1537, 'allow': 1538, 'adhesion': 1539, 'what': 1540, 'assume': 1541, 'job': 1542, 'response': 1543, 'death': 1544, 'investigation': 1545, 'embed': 1546, 'wick': 1547, 'performer': 1548, 'nipple': 1549, 'lime': 1550, 'reactive': 1551, 'upward': 1552, 'spume': 1553, 'willing': 1554, 'displace': 1555, 'bending': 1556, 'graze': 1557, 'servant': 1558, 'bowl': 1559, 'despite': 1560, 'hood': 1561, 'crew': 1562, 'subjection': 1563, 'cycle': 1564, 'jackleg': 1565, 'uncover': 1566, 'rise': 1567, 'importance': 1568, 'derive': 1569, 'pillar': 1570, 'figure': 1571, 'geology': 1572, 'temporarily': 1573, 'bos': 1574, 'hexagonal': 1575, 'exert': 1576, 'dado': 1577, 'anticlockwise': 1578, 'operational': 1579, 'debris': 1580, 'extra': 1581, 'muscle': 1582, 'marking': 1583, 'scalp': 1584, 'sheepskin': 1585, 'quinoa': 1586, 'lie': 1587, 'laterally': 1588, 'instruct': 1589, 'existence': 1590, 'eyebolt': 1591, 'click': 1592, 'locker': 1593, 'misalignment': 1594, 'scraper': 1595, 'superficially': 1596, 'activation': 1597, 'piping': 1598, 'uncoupled': 1599, 'ambulatory': 1600, 'liter': 1601, 'assist': 1602, 'apply': 1603, 'resistance': 1604, 'lineman': 1605, 'beat': 1606, 'frontal': 1607, 'office': 1608, 'stair': 1609, 'concentrator': 1610, 'flotation': 1611, 'chair': 1612, 'grille': 1613, 'filtration': 1614, 'motion': 1615, 'burr': 1616, 'expose': 1617, 'draw': 1618, 'jet': 1619, 'preventive': 1620, 'roller': 1621, 'bear': 1622, 'these': 1623, 'proximal': 1624, 'review': 1625, 'laden': 1626, 'curve': 1627, 'unevenness': 1628, 'hoist': 1629, 'explosion': 1630, 'straight': 1631, 'report': 1632, 'walking': 1633, 'progressive': 1634, 'temporary': 1635, 'cue': 1636, 'burst': 1637, 'thunderous': 1638, 'psi': 1639, 'pneumatic': 1640, 'night': 1641, 'shift': 1642, 'none': 1643, 'damage': 1644, 'detached': 1645, 'duty': 1646, 'former': 1647, 'primary': 1648, 'vista': 1649, 'fence': 1650, 'consultation': 1651, 'diagnose': 1652, 'remedy': 1653, 'ice': 1654, 'new': 1655, 'evaporator': 1656, 'spoon': 1657, 'slag': 1658, 'pear': 1659, 'superciliary': 1660, 'prong': 1661, 'auxiliar': 1662, 'divert': 1663, 'diversion': 1664, 'wood': 1665, 'wheelbarrow': 1666, 'pant': 1667, 'scorpion': 1668, 'becker': 1669, 'marble': 1670, 'breaker': 1671, 'fixing': 1672, 'sailor': 1673, 'element': 1674, 'bundle': 1675, 'polymer': 1676, 'crushing': 1677, 'snack': 1678, 'particle': 1679}
len(tokenizer.word_index)
1679
# Printing features before tokenizer
print(X_train[10])
while segment the pulley protective weigh the head pulley the ore winch when the pulley rotate compress the inside the channel from its housing rub the right side the hip generate the injury
X_train = tokenizer.texts_to_sequences(X_train.tolist())
# Printing features after tokenizer
print(X_train[10])
[1, 731, 494, 93, 1, 364, 181, 68, 732, 3, 31, 101, 1, 1, 1014, 365, 50, 32, 1, 495, 731, 54, 107, 1, 732, 255, 5, 1, 255, 256, 4, 24, 1, 93, 69, 119, 1, 494, 2, 74, 3, 182, 1, 494]
X_test = tokenizer.texts_to_sequences(X_test.tolist())
# Define maximum number of words to consider in each text
maxlen = max_description_len
# Pad training text
X_train = pad_sequences(X_train, maxlen= maxlen, padding='pre', truncating='post')
# Pad testing text
X_test = pad_sequences(X_test, maxlen= maxlen, padding='pre', truncating='post')
print(X_train.shape)
(340, 133)
num = np.random.randint(0, X_train.shape[0])
print(X_train[10])
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 731 494 93 1 364 181 68 732
3 31 101 1 1 1014 365 50 32 1 495 731 54 107
1 732 255 5 1 255 256 4 24 1 93 69 119 1
494 2 74 3 182 1 494]
import gensim
import gensim.downloader as api
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import Word2Vec, KeyedVectors
# Glove file - we are using model with 200 embeddings
glove_input_file = working_dir + 'glove.6B.200d.txt'
# Name for word2vec file
word2vec_output_file = working_dir + 'glove.6B.200d.txt.word2vec'
# Converting glove embedding to word2vec embedding
glove2word2vec(glove_input_file, word2vec_output_file)
(400000, 200)
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
# Size of glove model
glove_model.vectors.shape
(400000, 200)
# Size of glove model
glove_model.vectors.shape
(400000, 200)
# Getting Pre-trained embedding
embedding_vector_length= glove_model.vector_size
embedding_vector_length
200
vocab_size = len(tokenizer.word_index)+1
vocab_size
1680
num_words = min(max_features, vocab_size)
num_words
1680
embedding_matrix = np.zeros((num_words, embedding_vector_length))
embedding_matrix.shape
(1680, 200)
#Loading word vectors for each word from glove Word2Vec model
for word, i in sorted(tokenizer.word_index.items(),key=lambda x:x[1]):
if i > (num_words):
break
try:
embedding_vector = glove_model[word] #Reading word's embedding from Glove model for a given word
embedding_matrix[i] = embedding_vector
except:
pass
# Embedding matix shape
embedding_matrix.shape
(1680, 200)
# embedded data
num = np.random.randint(0, embedding_matrix.shape[0])
# Initializing the model
clear_session()
nn_model = Sequential()
# Embedding layer
nn_model.add(Embedding(input_dim= num_words, output_dim= embedding_vector_length,
weights = [embedding_matrix],
trainable = False,
input_length = maxlen))
nn_model.output
<tf.Tensor 'embedding/embedding_lookup/Identity_1:0' shape=(None, 133, 200) dtype=float32>
Embedding Layer gives us 3D output -> [Batch_Size , Review Length , Embedding_Size]
# Flatten the data as will use Dense layer
nn_model.add(Flatten())
# Adding Hidden Layers(Dense layers)
nn_model.add(Dense(100, activation='relu', input_shape=()))
nn_model.add(Dropout(0.4))
nn_model.add(BatchNormalization())
nn_model.add(Dense(50, activation='relu'))
nn_model.add(Dropout(0.4))
nn_model.add(BatchNormalization())
nn_model.add(Dense(25, activation='relu'))
nn_model.add(Dropout(0.4))
# Adding output layer
nn_model.add(Dense(5, activation='softmax'))
nn_model.output
<tf.Tensor 'dense_3/Softmax:0' shape=(None, 5) dtype=float32>
# Compiling the model
nn_model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
nn_model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 133, 200) 336000 _________________________________________________________________ flatten (Flatten) (None, 26600) 0 _________________________________________________________________ dense (Dense) (None, 100) 2660100 _________________________________________________________________ dropout (Dropout) (None, 100) 0 _________________________________________________________________ batch_normalization (BatchNo (None, 100) 400 _________________________________________________________________ dense_1 (Dense) (None, 50) 5050 _________________________________________________________________ dropout_1 (Dropout) (None, 50) 0 _________________________________________________________________ batch_normalization_1 (Batch (None, 50) 200 _________________________________________________________________ dense_2 (Dense) (None, 25) 1275 _________________________________________________________________ dropout_2 (Dropout) (None, 25) 0 _________________________________________________________________ dense_3 (Dense) (None, 5) 130 ================================================================= Total params: 3,003,155 Trainable params: 2,666,855 Non-trainable params: 336,300 _________________________________________________________________
# Using callback function to stop the model the loss is not reducing or accuracy is not improving
early = EarlyStopping(monitor='val_loss', patience=7, verbose=1,min_delta=0.0001, mode='auto') # min_delta=0.0001,
reduce_learning = ReduceLROnPlateau(patience=5, verbose=1, min_lr=1e-6, factor=0.2)
callback_list = [early, reduce_learning]
nn_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, batch_size=32, callbacks= [callback_list])
Epoch 1/100 11/11 [==============================] - 1s 59ms/step - loss: 2.1875 - accuracy: 0.1853 - val_loss: 1.4781 - val_accuracy: 0.3647 Epoch 2/100 11/11 [==============================] - 0s 22ms/step - loss: 1.8765 - accuracy: 0.2559 - val_loss: 1.4593 - val_accuracy: 0.4118 Epoch 3/100 11/11 [==============================] - 0s 25ms/step - loss: 1.6950 - accuracy: 0.3059 - val_loss: 1.4437 - val_accuracy: 0.3882 Epoch 4/100 11/11 [==============================] - 0s 26ms/step - loss: 1.6189 - accuracy: 0.3235 - val_loss: 1.4299 - val_accuracy: 0.4471 Epoch 5/100 11/11 [==============================] - 0s 24ms/step - loss: 1.6389 - accuracy: 0.3324 - val_loss: 1.4315 - val_accuracy: 0.4353 Epoch 6/100 11/11 [==============================] - 0s 24ms/step - loss: 1.5096 - accuracy: 0.3559 - val_loss: 1.4385 - val_accuracy: 0.3882 Epoch 7/100 11/11 [==============================] - 0s 23ms/step - loss: 1.5044 - accuracy: 0.3618 - val_loss: 1.4781 - val_accuracy: 0.3294 Epoch 8/100 11/11 [==============================] - 0s 24ms/step - loss: 1.3932 - accuracy: 0.4235 - val_loss: 1.5004 - val_accuracy: 0.2941 Epoch 9/100 10/11 [==========================>...] - ETA: 0s - loss: 1.3748 - accuracy: 0.4281 Epoch 00009: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026. 11/11 [==============================] - 0s 24ms/step - loss: 1.3880 - accuracy: 0.4206 - val_loss: 1.4932 - val_accuracy: 0.2824 Epoch 10/100 11/11 [==============================] - 0s 26ms/step - loss: 1.3811 - accuracy: 0.4147 - val_loss: 1.4894 - val_accuracy: 0.3059 Epoch 11/100 11/11 [==============================] - 0s 23ms/step - loss: 1.3608 - accuracy: 0.4500 - val_loss: 1.4921 - val_accuracy: 0.2706 Epoch 00011: early stopping
<tensorflow.python.keras.callbacks.History at 0x219ec110280>
# Printing the performance matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
def print_confusion_matrix(y_test, ytest_predict):
cm = confusion_matrix(y_test, ytest_predict)
cm = pd.DataFrame(cm)
plt.figure(figsize=(4,4))
sns.set()
sns.heatmap(cm.T, square=True, fmt='', annot=True, cbar=False, cmap='plasma',
xticklabels=['1','2','3','4','5'], yticklabels=['1','2','3','4','5'],).set_title('Confusion Matrix')
plt.xlabel('True label')
plt.ylabel('Predicted label')
plt.show()
ytest_predict = nn_model.predict(X_test)
ytest_predict_binary = ytest_predict >= 0.5
print_confusion_matrix(y_test.argmax(axis=1), ytest_predict.argmax(axis=1))
print(classification_report(y_test.argmax(axis=1), ytest_predict.argmax(axis=1), target_names=['1','2','3','4','5']))
precision recall f1-score support
1 0.50 0.10 0.17 10
2 0.29 0.08 0.12 25
3 0.17 0.64 0.27 14
4 0.46 0.33 0.39 33
5 0.00 0.00 0.00 3
accuracy 0.27 85
macro avg 0.28 0.23 0.19 85
weighted avg 0.35 0.27 0.25 85
# Initializing the model
clear_session()
LSTM_model = Sequential()
# Embedding layer
LSTM_model.add(Embedding(input_dim= num_words, output_dim= embedding_vector_length,
weights = [embedding_matrix],
trainable = False,
input_length = maxlen))
LSTM_model.output
<tf.Tensor 'embedding/embedding_lookup/Identity_1:0' shape=(None, 133, 200) dtype=float32>
# Adding the Bidirectional LSTM layer
LSTM_model.add(Bidirectional(LSTM(100, return_sequences = True, dropout= 0.4))) #50
# Adding global pooling to make it 1D
LSTM_model.add(GlobalMaxPooling1D())
# Adding dropout to avoid overfitting
LSTM_model.add(Dropout(0.4))
# Adding output layer
#LSTM_model.add(Dense(6, activation='softmax'))
LSTM_model.add(Dense(5, activation = 'softmax'))
LSTM_model.output
<tf.Tensor 'dense/Softmax:0' shape=(None, 5) dtype=float32>
LSTM_model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 133, 200) 336000 _________________________________________________________________ bidirectional (Bidirectional (None, 133, 200) 240800 _________________________________________________________________ global_max_pooling1d (Global (None, 200) 0 _________________________________________________________________ dropout (Dropout) (None, 200) 0 _________________________________________________________________ dense (Dense) (None, 5) 1005 ================================================================= Total params: 577,805 Trainable params: 241,805 Non-trainable params: 336,000 _________________________________________________________________
# Compiling the model
LSTM_model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
# Using callback function to stop the model the loss is not reducing or accuracy is not improving
early = EarlyStopping(monitor='val_loss', patience=15, verbose=1, min_delta=1, mode='auto')
reduce_learning = ReduceLROnPlateau(patience=15, verbose=1, min_lr=1e-6, factor=0.2)
model_cp = ModelCheckpoint('Industrial_chatbot.h5',monitor='val_loss', save_best_only= True, verbose=1,)
callback_list = [early, reduce_learning, model_cp]
# may be run for more epochs
LSTM_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=30, batch_size=32, callbacks= [callback_list])
Epoch 1/30 11/11 [==============================] - ETA: 0s - loss: 1.5696 - accuracy: 0.2882 Epoch 00001: val_loss improved from inf to 1.49698, saving model to Industrial_chatbot.h5 11/11 [==============================] - 4s 397ms/step - loss: 1.5696 - accuracy: 0.2882 - val_loss: 1.4970 - val_accuracy: 0.1647 Epoch 2/30 11/11 [==============================] - ETA: 0s - loss: 1.5336 - accuracy: 0.2912 Epoch 00002: val_loss improved from 1.49698 to 1.45338, saving model to Industrial_chatbot.h5 11/11 [==============================] - 3s 295ms/step - loss: 1.5336 - accuracy: 0.2912 - val_loss: 1.4534 - val_accuracy: 0.3765 Epoch 3/30 11/11 [==============================] - ETA: 0s - loss: 1.4752 - accuracy: 0.3353 Epoch 00003: val_loss improved from 1.45338 to 1.39933, saving model to Industrial_chatbot.h5 11/11 [==============================] - 3s 309ms/step - loss: 1.4752 - accuracy: 0.3353 - val_loss: 1.3993 - val_accuracy: 0.3882 Epoch 4/30 11/11 [==============================] - ETA: 0s - loss: 1.4325 - accuracy: 0.3559 Epoch 00004: val_loss did not improve from 1.39933 11/11 [==============================] - 3s 250ms/step - loss: 1.4325 - accuracy: 0.3559 - val_loss: 1.4457 - val_accuracy: 0.1765 Epoch 5/30 11/11 [==============================] - ETA: 0s - loss: 1.3902 - accuracy: 0.3824 Epoch 00005: val_loss improved from 1.39933 to 1.36741, saving model to Industrial_chatbot.h5 11/11 [==============================] - 4s 338ms/step - loss: 1.3902 - accuracy: 0.3824 - val_loss: 1.3674 - val_accuracy: 0.4824 Epoch 6/30 11/11 [==============================] - ETA: 0s - loss: 1.3437 - accuracy: 0.4147 Epoch 00006: val_loss improved from 1.36741 to 1.35125, saving model to Industrial_chatbot.h5 11/11 [==============================] - 3s 309ms/step - loss: 1.3437 - accuracy: 0.4147 - val_loss: 1.3512 - val_accuracy: 0.4118 Epoch 7/30 11/11 [==============================] - ETA: 0s - loss: 1.3047 - accuracy: 0.4441 Epoch 00007: val_loss improved from 1.35125 to 1.32562, saving model to Industrial_chatbot.h5 11/11 [==============================] - 3s 288ms/step - loss: 1.3047 - accuracy: 0.4441 - val_loss: 1.3256 - val_accuracy: 0.4706 Epoch 8/30 11/11 [==============================] - ETA: 0s - loss: 1.2708 - accuracy: 0.4529 Epoch 00008: val_loss improved from 1.32562 to 1.29711, saving model to Industrial_chatbot.h5 11/11 [==============================] - 3s 296ms/step - loss: 1.2708 - accuracy: 0.4529 - val_loss: 1.2971 - val_accuracy: 0.4824 Epoch 9/30 11/11 [==============================] - ETA: 0s - loss: 1.2245 - accuracy: 0.4853 Epoch 00009: val_loss improved from 1.29711 to 1.26461, saving model to Industrial_chatbot.h5 11/11 [==============================] - 3s 298ms/step - loss: 1.2245 - accuracy: 0.4853 - val_loss: 1.2646 - val_accuracy: 0.5059 Epoch 10/30 11/11 [==============================] - ETA: 0s - loss: 1.1799 - accuracy: 0.5059 Epoch 00010: val_loss improved from 1.26461 to 1.26279, saving model to Industrial_chatbot.h5 11/11 [==============================] - 3s 302ms/step - loss: 1.1799 - accuracy: 0.5059 - val_loss: 1.2628 - val_accuracy: 0.4471 Epoch 11/30 11/11 [==============================] - ETA: 0s - loss: 1.0956 - accuracy: 0.5647 Epoch 00011: val_loss did not improve from 1.26279 11/11 [==============================] - 3s 286ms/step - loss: 1.0956 - accuracy: 0.5647 - val_loss: 1.3072 - val_accuracy: 0.4235 Epoch 12/30 11/11 [==============================] - ETA: 0s - loss: 1.1319 - accuracy: 0.5324 Epoch 00012: val_loss did not improve from 1.26279 11/11 [==============================] - 3s 251ms/step - loss: 1.1319 - accuracy: 0.5324 - val_loss: 1.2917 - val_accuracy: 0.4588 Epoch 13/30 11/11 [==============================] - ETA: 0s - loss: 1.0320 - accuracy: 0.6059 Epoch 00013: val_loss improved from 1.26279 to 1.25790, saving model to Industrial_chatbot.h5 11/11 [==============================] - 4s 332ms/step - loss: 1.0320 - accuracy: 0.6059 - val_loss: 1.2579 - val_accuracy: 0.4588 Epoch 14/30 11/11 [==============================] - ETA: 0s - loss: 1.0536 - accuracy: 0.5824 Epoch 00014: val_loss improved from 1.25790 to 1.25350, saving model to Industrial_chatbot.h5 11/11 [==============================] - 3s 288ms/step - loss: 1.0536 - accuracy: 0.5824 - val_loss: 1.2535 - val_accuracy: 0.4588 Epoch 15/30 11/11 [==============================] - ETA: 0s - loss: 0.9555 - accuracy: 0.6382 Epoch 00015: val_loss improved from 1.25350 to 1.25350, saving model to Industrial_chatbot.h5 11/11 [==============================] - 3s 282ms/step - loss: 0.9555 - accuracy: 0.6382 - val_loss: 1.2535 - val_accuracy: 0.5412 Epoch 16/30 11/11 [==============================] - ETA: 0s - loss: 0.9619 - accuracy: 0.6353 Epoch 00016: val_loss did not improve from 1.25350 11/11 [==============================] - 3s 286ms/step - loss: 0.9619 - accuracy: 0.6353 - val_loss: 1.2699 - val_accuracy: 0.4235 Epoch 00016: early stopping
<tensorflow.python.keras.callbacks.History at 0x219ef7ffca0>
# Checking the history of the model
plt.plot(LSTM_model.history.history['val_loss']);
# Checking the history of the model
plt.plot(LSTM_model.history.history['val_accuracy']);
# Evaluating the model
test_result = LSTM_model.evaluate(X_test, y_test)
3/3 [==============================] - 0s 46ms/step - loss: 1.2699 - accuracy: 0.4235
print('Test accuracy of the model:{0:.2%}'.format(test_result[1]))
Test accuracy of the model:42.35%
# Printing the performance matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
def print_confusion_matrix(y_test, ytest_predict):
cm = confusion_matrix(y_test, ytest_predict)
cm = pd.DataFrame(cm)
plt.figure(figsize=(4,4))
sns.set()
sns.heatmap(cm.T, square=True, fmt='', annot=True, cbar=False, cmap='plasma',
xticklabels=['1','2','3','4','5'], yticklabels=['1','2','3','4','5'],).set_title('Confusion Matrix')
plt.xlabel('True label')
plt.ylabel('Predicted label')
plt.show()
ytest_predict = LSTM_model.predict(X_test)
ytest_pred_binary = ytest_predict>=0.5
print_confusion_matrix(y_test.argmax(axis=1), ytest_predict.argmax(axis=1))
print(classification_report(y_test.argmax(axis=1), ytest_predict.argmax(axis=1), target_names=['1','2','3','4','5']))
precision recall f1-score support
1 0.40 0.40 0.40 10
2 0.50 0.44 0.47 25
3 0.30 0.71 0.43 14
4 0.64 0.27 0.38 33
5 0.33 0.67 0.44 3
accuracy 0.42 85
macro avg 0.44 0.50 0.42 85
weighted avg 0.51 0.42 0.42 85
import fasttext
import csv
# Creating a DataFrame for FastText model
ft_df_Potential = pd.DataFrame(columns=['fasttext_data'])
# Preparing the data for model
ft_df_Potential['fasttext_data'] ='__label__' + ds['Potential Accident Level'].astype(str) + " "+ds['clean_Description'].astype(str)
train_Potential = ft_df_Potential.head(340)
valid_Potential = ft_df_Potential.tail(85)
train_Potential.to_csv(r'train_Potential.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
valid_Potential.to_csv(r'valid_Potential.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
model = fasttext.train_supervised(input="train_Potential.txt", lr=0.5, epoch=300, wordNgrams=2, bucket=200000, dim=50, loss='ova')
model.predict('forklift went manipulate big bag bioxide section front ladder leads manual displacement splashed spent height forehead fissure pipe subsequently spilling left eye went nearby eyewash cleaning immediately medical center')
(('__label__2',), array([1.00001001]))
model.test("valid_Potential.txt")
(85, 0.4588235294117647, 0.4588235294117647)
ft_model_Potential = fasttext.train_supervised(input="train_Potential.txt", epoch=300)
ft_model_Potential.test("valid_Potential.txt")
(85, 0.4588235294117647, 0.4588235294117647)
#With both Lr and EPOC combined
ft_model_Potential = fasttext.train_supervised(input="train_Potential.txt", lr=0.7, epoch=300)
ft_model_Potential.test("valid_Potential.txt")
(85, 0.47058823529411764, 0.47058823529411764)
ft_model_Potential = fasttext.train_supervised(input="train_Potential.txt", lr=0.7, epoch=300, wordNgrams=1)
ft_model_Potential.test("valid_Potential.txt")
(85, 0.47058823529411764, 0.47058823529411764)
ft_model_Potential = fasttext.train_supervised(input="train_Potential.txt", lr=0.7, epoch=300,bucket=200000, dim=50, loss='hs')
ft_model_Potential.test("valid_Potential.txt")
(85, 0.4235294117647059, 0.4235294117647059)
#tried with Multi labels too
ft_model_Potential = fasttext.train_supervised(input="train_Potential.txt", lr=0.7, epoch=300, bucket=200000, dim=50, loss='ova')
ft_model_Potential.test("valid_Potential.txt")
(85, 0.4470588235294118, 0.4470588235294118)
ft_df_multilabel = pd.DataFrame(columns= ['ft_df_multi'])
ft_df_multilabel['ft_df_multi'] ='__label__' + ds['Accident Level'].astype(str) + " " + '__label__' + ds['Potential Accident Level'].astype(str) + \
" "+ds['clean_Description'].astype(str)
num = np.random.randint(0, ft_df_multilabel.shape[0])
data = ft_df_multilabel.loc[num, 'ft_df_multi']
train_multi = ft_df_multilabel.head(340)
valid_multi = ft_df_multilabel.tail(85)
train_multi.to_csv(r'train_multi.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
valid_multi.to_csv(r'valid_multi.txt', index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")
#With both Lr and EPOC combined
model = fasttext.train_supervised(input = 'train_multi.txt', lr = 0.7, epoch = 350,loss='ova')
model.test('valid_multi.txt')
(85, 0.7294117647058823, 0.36470588235294116)
num = np.random.randint(0, ft_df_multilabel.shape[0])
data = ft_df_multilabel.loc[num, 'ft_df_multi']
print('Actual with label:',data)
print(' ')
model.predict(data, k =2)
Actual with label: __label__1 __label__3 that carry out the inspection the cut the block level that the loading platform could that the positive radial that be cover and noise from the upper part the pit the center the pit which go back leave work but his metatarsal boot contact with rock that be the floor which lose balance and stumble the gable
(('__label__1', '__label__3'), array([1.00001001, 0.77730989]))
model.save_model('safebot_multi.bin')
We do not have a suffcient labeled text data, The dataset only contained only 425 records. Training a neural network requires large datasets because of the network contains huge number of parameters. Hence, training such networks on limited data will often lead to overfitting and low accuracy
Implimentation of SMOT has lead the data into high accuracy with a caveat of Overfitting.
Train Accuracy of the Random Forest model : 99.82 , Test Accuracy of the Random Forest model : 74.82
Train accuracy of the Gradient boosting model : 99.82 , Test accuracy of the Gradient boosting model : 65.47
Train accuracy of the LR model : 99.82 , Test accuracy of the LR model : 70.50
Train accuracy of the SVC model : 99.82 , Test accuracy of the SVC model : 72.66
While implementing bert we encountered issues such as paclakes incompatibility, version conflicts and methods of the source files are not stabile in view of the above and time we have parked BERT for future research
We are also aware that the NLP models are typically more shallow and thus require different fine-tuning methods
Fasttext auto hyperparameter tuning requires higher computing capability systems with 16GB RAM was not sufficient to support its computation. Upon,reaching out to community support we studied that few wrappers written in C++ were not stable (for instance autotune.cc wrapper file.)
Owing to this challenge we have parked Auto hyperparameter tuning for future research